Initialize Cutlass-SYCL support #6

airMeng · 2025-08-13T03:12:01Z

Add chunked prefill op. The PR works with OneAPI 2025.1 currently.

llama-3b BF16 accuracy results, verified on BMG-12GB. You need to install SGLang per instructions from https://github.com/airMeng/sglang/blob/xpu_attention/docs/platforms/xpu.md

-	gsm-8k	mmlu
intel_xpu	0.680	0.545
triton	0.650	0.546

To reproduce the accuracy results, launch the server first

python3 -m sglang.launch_server  --model /PATH/TO/MODEL  --dtype bfloat16 --tp 1  --trust-remote-code  --mem-fraction-static 0.8 --attention-backend intel_xpu --page-size 128 --port 3000

Run the accuracy scripts in SGLang

cd ~/sglang/benchmark/gsm8k
python3 bench_sglang.py --num-questions 200 --port 3000
cd ../mmlu
python3 bench_sglang.py --nsub 20 --port 3000

~~The PR can't work with the current open source OneAPI due to an issue of SYCLCompat. You can update your local OneAPI according to the intel/llvm#19673~~

CMakeLists.txt

Add chunked prefill op

tests/test_flash_attention.py

python/sgl_kernel/flash_attn.py

polisettyvarma · 2025-09-15T07:39:12Z

python/sgl_kernel/flash_attn.py

+    if cu_seqlens_q == None:  # !is_varlen_q
+        cu_seqlens_q = torch.arange(
+            0, q.size(0) + 1, dtype=torch.int, device=q.device
+        ) * q.size(1)
+        max_seqlen_q = q.size(1)
+        q = q.view(-1, q.size(-2), q.size(-1)).contiguous()
+    if cu_seqlens_k_new is None and k is not None:  # !is_varlen_k_new
+        cu_seqlens_k_new = torch.arange(
+            0, k.size(0) + 1, dtype=torch.int, device=k.device
+        )
+    elif k is None:
+        cu_seqlens_k_new = torch.zeros_like(
+            cu_seqlens_q, dtype=torch.int32, device=q.device
+        )
+    if cache_seqlens is not None:
+        max_seqlen_k = cache_seqlens.max().item()
+        assert cache_seqlens.size(0) + 1 == cu_seqlens_q.size(0)
+        max_page_size_per_seq = page_table.size(1)
+        num_pages_per_seq = torch.arange(
+            0,
+            cache_seqlens.size(0) * max_page_size_per_seq,
+            max_page_size_per_seq,
+            device=cache_seqlens.device,
+        ).to(torch.int32)
+        cu_seqlens_k = torch.concat(
+            (
+                torch.zeros(1, dtype=torch.int32, device=cache_seqlens.device),
+                torch.cumsum(cache_seqlens, 0),
+            )
+        ).to(torch.int32)
+


these ops are causing perf degrade compared to triton

no worries, we are aware of this. this PR still needs a lot of change.

don't have to pay too much attention for it right now, will be fixed later.

polisettyvarma · 2025-09-16T02:43:40Z

include/sgl_flash_kernel_ops.h

-    std::optional<at::Tensor>& q_descale_,             // (b, h_k), not (b, h)
-    std::optional<at::Tensor>& k_descale_,             // (b, h_k)
-    std::optional<at::Tensor>& v_descale_,             // (b, h_k)
+    std::optional<const at::Tensor>& page_table_,         // (b_k, max_num_pages_per_seq)


why are we changing function signature ?

polisettyvarma · 2025-09-17T05:23:49Z

python/sgl_kernel/flash_attn.py

+    if cu_seqlens_q == None:  # !is_varlen_q
+        cu_seqlens_q = torch.arange(
+            0, q.size(0) + 1, dtype=torch.int, device=q.device
+        ) * q.size(1)
+        max_seqlen_q = q.size(1)
+        q = q.view(-1, q.size(-2), q.size(-1)).contiguous()
+    batch_size = cu_seqlens_q.numel() - 1
+    page_table = (
+        torch.arange(0, batch_size, device=q.device)
+        .to(torch.int32)
+        .reshape([batch_size, 1])
+        .contiguous()
+    )


what extra functionality we are trying to provide ?

current kernel implementation are align between vllm and sglang requests, so there will be some changes on the sglang side.”

sunjiweiswift · 2025-09-18T08:18:45Z

src/sycl/chunked_prefill.cpp

+#include "cutlass/util/device_memory.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/reference/device/gemm_complex.h"
+#include "cutlass/util/reference/device/tensor_compare.h"


don't need header files of verify code

sunjiweiswift · 2025-09-18T08:21:51Z

src/sycl/chunked_prefill.cpp

+    if (params.page_table != nullptr && params.cu_seqlens_k != nullptr) {
+      return run<true, true, cutlass::flash_attention::IndividualScheduler>(params);
+    } else {
+      return 0;


only use page_kv?

sunjiweiswift · 2025-09-18T08:25:40Z

src/sycl/chunked_prefill.cpp

+    CHECK_DEVICE(v_new);
+    TORCH_CHECK(k_new.stride(-1) == 1, "k_new tensor must have contiguous last dimension");
+    TORCH_CHECK(v_new.stride(-1) == 1, "v_new tensor must have contiguous last dimension");
+    int seqlen_k_new = !is_varlen_k_new ? k_new.size(1) : 1;


seqlen_kv_new =1? or 0

sunjiweiswift · 2025-09-18T08:26:24Z

src/sycl/chunked_prefill.cpp

+  at::Tensor out_accum, softmax_lse_accum;
+  auto outaccum_type = at::ScalarType::Float;
+
+  constexpr int PipelineStages = 0;


lint fix

mingfeima · 2025-09-09T08:01:48Z

src/sycl/chunked_prefill.cpp

+#define CHECK_DEVICE(x) TORCH_CHECK(x.is_xpu(), #x " must be on XPU")
+#define CHECK_SHAPE(x, ...) \
+  TORCH_CHECK(x.sizes() == torch::IntArrayRef({__VA_ARGS__}), #x " must have shape (" #__VA_ARGS__ ")")
+#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")


move to utils

mingfeima · 2025-09-15T07:43:53Z

python/sgl_kernel/flash_attn.py

+    if cu_seqlens_q == None:  # !is_varlen_q
+        cu_seqlens_q = torch.arange(
+            0, q.size(0) + 1, dtype=torch.int, device=q.device
+        ) * q.size(1)
+        max_seqlen_q = q.size(1)
+        q = q.view(-1, q.size(-2), q.size(-1)).contiguous()
+    if cu_seqlens_k_new is None and k is not None:  # !is_varlen_k_new
+        cu_seqlens_k_new = torch.arange(
+            0, k.size(0) + 1, dtype=torch.int, device=k.device
+        )
+    elif k is None:
+        cu_seqlens_k_new = torch.zeros_like(
+            cu_seqlens_q, dtype=torch.int32, device=q.device
+        )
+    if cache_seqlens is not None:
+        max_seqlen_k = cache_seqlens.max().item()
+        assert cache_seqlens.size(0) + 1 == cu_seqlens_q.size(0)
+        max_page_size_per_seq = page_table.size(1)
+        num_pages_per_seq = torch.arange(
+            0,
+            cache_seqlens.size(0) * max_page_size_per_seq,
+            max_page_size_per_seq,
+            device=cache_seqlens.device,
+        ).to(torch.int32)
+        cu_seqlens_k = torch.concat(
+            (
+                torch.zeros(1, dtype=torch.int32, device=cache_seqlens.device),
+                torch.cumsum(cache_seqlens, 0),
+            )
+        ).to(torch.int32)
+


no worries, we are aware of this. this PR still needs a lot of change.

mingfeima · 2025-09-15T07:44:38Z

python/sgl_kernel/flash_attn.py

+    if cu_seqlens_q == None:  # !is_varlen_q
+        cu_seqlens_q = torch.arange(
+            0, q.size(0) + 1, dtype=torch.int, device=q.device
+        ) * q.size(1)
+        max_seqlen_q = q.size(1)
+        q = q.view(-1, q.size(-2), q.size(-1)).contiguous()
+    if cu_seqlens_k_new is None and k is not None:  # !is_varlen_k_new
+        cu_seqlens_k_new = torch.arange(
+            0, k.size(0) + 1, dtype=torch.int, device=k.device
+        )
+    elif k is None:
+        cu_seqlens_k_new = torch.zeros_like(
+            cu_seqlens_q, dtype=torch.int32, device=q.device
+        )
+    if cache_seqlens is not None:
+        max_seqlen_k = cache_seqlens.max().item()
+        assert cache_seqlens.size(0) + 1 == cu_seqlens_q.size(0)
+        max_page_size_per_seq = page_table.size(1)
+        num_pages_per_seq = torch.arange(
+            0,
+            cache_seqlens.size(0) * max_page_size_per_seq,
+            max_page_size_per_seq,
+            device=cache_seqlens.device,
+        ).to(torch.int32)
+        cu_seqlens_k = torch.concat(
+            (
+                torch.zeros(1, dtype=torch.int32, device=cache_seqlens.device),
+                torch.cumsum(cache_seqlens, 0),
+            )
+        ).to(torch.int32)
+


don't have to pay too much attention for it right now, will be fixed later.

mingfeima · 2025-10-11T03:00:51Z

@airMeng fix lint

* initialize Cutlass support Add chunked prefill op --------- Co-authored-by: Swift.Sun <[email protected]>

airMeng marked this pull request as draft August 13, 2025 03:12

airMeng commented Aug 13, 2025

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

airMeng requested review from mingfeima and sunjiweiswift August 13, 2025 05:09

airMeng force-pushed the cutlass_attntion branch from a094121 to d5a32ec Compare September 5, 2025 08:07

airMeng added 8 commits September 5, 2025 22:31

initialize Cutlass support

b46c6b0

Add chunked prefill op

remove the experimental launch headers

866ab6c

add bindings

20d35e0

enable ut

25c7bd5

align the interface

6bd00d7

fix device lost

d5a32ec

fix lint

77d3545

backup

31f6fa9

airMeng marked this pull request as ready for review September 9, 2025 05:17

airMeng commented Sep 9, 2025

View reviewed changes

tests/test_flash_attention.py Outdated Show resolved Hide resolved

airMeng commented Sep 9, 2025

View reviewed changes

python/sgl_kernel/flash_attn.py Show resolved Hide resolved

airMeng force-pushed the cutlass_attntion branch 2 times, most recently from 4319c23 to 6ad98d8 Compare September 9, 2025 06:07

airMeng added 7 commits September 9, 2025 19:41

fix unused page

5388444

enable test

2863442

small fix

b8a6074

fix lint

6ad98d8

small fix

1550a6a

Merge branch 'main' into cutlass_attntion

9dba4ff

small typo

f5c2c89

polisettyvarma reviewed Sep 15, 2025

View reviewed changes

mingfeima marked this pull request as draft September 15, 2025 07:44

polisettyvarma reviewed Sep 16, 2025

View reviewed changes

spirv build flags

a05d6ce

airMeng force-pushed the cutlass_attntion branch from a05d6ce to f5c2c89 Compare September 17, 2025 03:01

polisettyvarma reviewed Sep 17, 2025

View reviewed changes

fix different queue

8b0d167

sunjiweiswift reviewed Sep 18, 2025

View reviewed changes

sunjiweiswift and others added 3 commits September 22, 2025 18:20

Update chunked_prefill.cpp piplinestage=2

c394772

revert spirv flags; update cutlass

0079b6e

remove syclcompat dependency

8a3ddea

airMeng force-pushed the cutlass_attntion branch from d2e195a to 67a20fe Compare September 24, 2025 01:56

airMeng added 6 commits September 24, 2025 16:23

enable causal

67a20fe

update cutlass commit

53850dc

remove the useless wait

884200b

delete kv_new and cur_kv_new

6433273

move check into Utils.h

f183b23

lint fix

enable FA on PVC

cdf10ed

mingfeima approved these changes Oct 11, 2025

View reviewed changes

mingfeima marked this pull request as ready for review October 11, 2025 03:00

airMeng merged commit 1b3eb39 into main Oct 11, 2025
3 checks passed

sunjiweiswift added a commit to sunjiweiswift/sgl-kernel-xpu that referenced this pull request Oct 21, 2025

Initialize Cutlass-SYCL support (sgl-project#6)

cc61913

* initialize Cutlass support Add chunked prefill op --------- Co-authored-by: Swift.Sun <[email protected]>

Initialize Cutlass-SYCL support #6

Initialize Cutlass-SYCL support #6

Uh oh!

Conversation

airMeng commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mingfeima commented Oct 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

airMeng commented Aug 13, 2025 •

edited

Loading